Preparations

Load the necessary libraries

library(car)       #for regression diagnostics
library(broom)     #for tidy output
library(broom.mixed)
library(ggfortify) #for model diagnostics
library(sjPlot)    #for outputs
library(knitr)     #for kable
library(effects)   #for partial effects plots
library(emmeans)   #for estimating marginal means
library(ggeffects)  #for partial effects plots
library(MASS)      #for glm.nb
library(MuMIn)     #for AICc
library(nlme)
library(lme4)      #for lmer
library(lmerTest)  #for satterthwaite p-values with lmer
library(performance) #for residuals diagnostics
library(see)         #for plotting residuals
#library(pbkrtest)  #for kenward-roger p-values with lmer
library(glmmTMB)   #for glmmTMB
library(DHARMa)   #for residuals and diagnostics
library(tidyverse) #for data wrangling
theme_set(theme_classic())

Scenario

A plant pathologist wanted to examine the effects of two different strengths of tobacco virus on the number of lesions on tobacco leaves. She knew from pilot studies that leaves were inherently very variable in response to the virus. In an attempt to account for this leaf to leaf variability, both treatments were applied to each leaf. Eight individual leaves were divided in half, with half of each leaf inoculated with weak strength virus and the other half inoculated with strong virus. So the leaves were blocks and each treatment was represented once in each block. A completely randomised design would have had 16 leaves, with 8 whole leaves randomly allocated to each treatment.

Tobacco plant

Sampling design

Format of tobacco.csv data files

leaf treat nlegion
1 Strong 35.898
1 Week 25.02
2 Strong 34.118
2 Week 23.167
3 Strong 35.702
3 Week 24.122
... ... ...
leaf The blocking factor - Factor B
treat Categorical representation of the strength of the tobacco virus - main factor of interest Factor A
nlegion Number of lesions on that part of the tobacco leaf - response variable

Read in the data

tobacco <- read_csv('../data/tobacco.csv', trim_ws=TRUE) %>%
  janitor::clean_names() %>% 
  rename(nlegion = number, treat = treatment) %>%
  mutate(leaf = factor(leaf), treat = fct_rev(treat))
glimpse(tobacco)
## Rows: 16
## Columns: 3
## $ leaf    <fct> L1, L1, L2, L2, L3, L3, L4, L4, L5, L5, L6, L6, L7, L7, L8, L8
## $ treat   <fct> Strong, Weak, Strong, Weak, Strong, Weak, Strong, Weak, Stron…
## $ nlegion <dbl> 35.89776, 25.01984, 34.11786, 23.16740, 35.70215, 24.12191, 2…

Exploratory data analysis

Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\boldsymbol{\beta} \bf{X_i} + \boldsymbol{\gamma} \bf{Z_i} \]

where \(\boldsymbol{\beta}\) and \(\boldsymbol{\gamma}\) are vectors of the fixed and random effects parameters respectively and \(\bf{X}\) is the model matrix representing the overall intercept and effects of the treatment on the number of lesions. \(\bf{Z}\) represents a cell means model matrix for the random intercepts associated with leaves.

We will start by explicitly declaring the categorical variable (treat) as a factor. In addition, random effects (in this case leaf) should also be declared as factors.

# already done above
# tobacco = tobacco %>%
#   mutate(leaf=factor(leaf),
#          treat=factor(treat))

To explore the assumptions of homogeneity of variance and normality, a boxplot of each Treatment level is appropriate.

ggplot(tobacco,  aes(y=nlegion,  x=treat)) +
  geom_boxplot() +
  labs(x = "Treatment", y = "Legion count")

This leads us to say that we think we can get away with a gaussian distribution, despite the two outliers in the strong treatment

Conclusions:

  • both normality and homogeneity of variance seem satisfied

It can also be useful to get a sense of the consistency across blocks (leaf). That is, do all leaves have a similar baseline level of lesion susceptibility and do they respond similarly to the treatment.

Another good plot for random effects across treatments:

ggplot(tobacco,  aes(y=nlegion,  x=as.numeric(leaf))) +
  geom_line(aes(linetype=treat))

## If we want to retain the original leaf labels
ggplot(tobacco,  aes(y=nlegion,  x=as.numeric(leaf))) +
  geom_blank(aes(x=leaf)) +
  geom_line(aes(linetype=treat))

Conclusions:

  • it is clear that some leaves are more susceptible to lesions (e.g. Leaf 7) than other leaves (e.g. Leaf 4)
  • most leaves (other than Leaf 4 and 6) have a similar response to the Treatments - that is most have higher number of lesions from the Strong Treatment than the Weak Treatment.

Given that there are only two levels of Treatment (Strong and Weak), it might be easier to visualise the differences in baselines and effect consistency by plotting as:

ggplot(tobacco,  aes(y=nlegion,  x=treat,  group=leaf)) +
  geom_point() +
  geom_line(aes(x=as.numeric(treat))) 

‘group’ works similar to ‘by,’ just that group is more versatile when it comes to mapping, etc. because it will automatically group by polygon, while by needs a named list as input. Otherwise, they work very similarly.

Conclusions:

  • this figure reiterates the points made earlier about the varying baselines and effect consistency.

The above figure also serves as a good way to visualise certain aspects of mixed effects models. When we fit a mixed effects model that includes a random blocking effect (in this case leaf), we are indicating that we are allowing there to be a different intercept for each block (leaf). In the current case, the intercept will represent the first Treatment level (Strong). So the random effect is specifying that the intercept can vary from Leaf to Leaf.

We can think of the model as having two tiers (a hierarchy), where the tiers of the hierarchy represent progressively smaller (typically) spatial scales. In the current example, the largest spatial units are the leaves (blocking factor). Within the leaves, there are the two Treatments (Strong and Weak) and within the Treatments are the individual observations.

Note: We can allow this, but also the model will tell us if this is unnecessary. If all the lines were parallel, we for sure would only need random intercepts.

We tend to represent this hierarchy upside down in the model formula:

\[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\beta_0 + \boldsymbol{\beta} \bf{X_i}\\ \beta_0 = \boldsymbol{\gamma} \bf{Z_i} \]

In addition to allowing there to be a different intercept per leaf, we could allow there to be a different magnitude of effect (difference between Strong and Week Treatment) per leaf. That is considered a random slope. From the figure above, there is some evidence that the effects (slopes) may vary from Leaf to Leaf.

Incorporating a random slope (in addition to a random intercept), may reduce the amount of unexplained variance and thus improve the power of the main effect (Treatment effect).

Fit the model

Note on ML optimizers

The default nlmib optimizer for determining maximum likelihood and restricted maximum likelihoods, which is designed for speed, but doesn’t always work. With smaller data such as this, we need to use a different engine to determine our highest maximum likelihood.

Random intercept model:

tobacco_lmm1 <- glmmTMB(nlegion ~ treat + (1|leaf), data=tobacco,
                       REML=T)

(1|leaf) is interpreted as ‘do the intercept [‘1’] by [‘|’] leaf as a random effect’

REML is residual or restricted maximum likelihood, is required when comparing different random effect structures.

Random slope and intercept model:

tobacco_lmm2 <- glmmTMB(nlegion ~ treat + ((1 + treat)|leaf), 
                        data=tobacco, REML=T); AICc(tobacco_lmm2)
## [1] 110.0038
# equivalent formulation:
tobacco_lmm2 <- glmmTMB(nlegion ~ treat + (treat|leaf), 
                        data=tobacco, REML=T); AICc(tobacco_lmm2)
## [1] 110.0038

((1 + treat)|random) also written as (1 + treat|leaf) or (treat|leaf) is interpreted as ‘do the intercept [‘1’] and slope [’treat’] by [‘|’] leaf as a random effect’

If the model failed to converge, we can use a different optimization algorithm. Here is how to change it:

better_opt <- glmmTMBControl(optimizer = "optim",
               optArgs = "Nelder-Mead")
tobacco_lmm2 <- glmmTMB(nlegion ~ treat + (treat|leaf), 
                        data=tobacco, REML=T,
                        control = better_opt)
tobacco_lmm2
## Formula:          nlegion ~ treat + (treat | leaf)
## Data: tobacco
##       AIC       BIC    logLik  df.resid 
## 100.67050 105.30603 -44.33525        12 
## Random-effects (co)variances:
## 
## Conditional model:
##  Groups   Name        Std.Dev. Corr  
##  leaf     (Intercept) 3.9701         
##           treatStrong 0.8194   -0.65 
##  Residual             3.7587         
## 
## Number of obs: 16 / Conditional model: leaf, 8
## 
## Dispersion estimate for gaussian family (sigma^2): 14.1 
## 
## Fixed Effects:
## 
## Conditional model:
## (Intercept)  treatStrong  
##      27.061        7.879

Nelder-Mead is quite robust, but much slower. If there is still convergence problems, there is likely an issue with the model we are fitting (too complex for the data available), rather than the optimizer!

Random slope model:

tobacco_lmm3 <- glmmTMB(nlegion ~ treat + ((-1 + treat)|leaf), 
                        data=tobacco, REML=T,
                        control = better_opt)
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')

Use AIC to determine the optimal random effect structure

AICc(tobacco_lmm1, tobacco_lmm2, tobacco_lmm3)

A random intercept model is preferred over random slope and intercept.

Model validation

plot_model(tobacco_lmm1, type = 'diag')[-2] %>% plot_grid()

plot_model(tobacco_lmm1, type = 'eff')
## $treat

ggemmeans(tobacco_lmm1, ~treat) %>% plot(add.data=T)

summary(tobacco_lmm1)
##  Family: gaussian  ( identity )
## Formula:          nlegion ~ treat + (1 | leaf)
## Data: tobacco
## 
##      AIC      BIC   logLik deviance df.resid 
##     96.7     99.8    -44.4     88.7       14 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev.
##  leaf     (Intercept) 13.63    3.692   
##  Residual             14.46    3.803   
## Number of obs: 16, groups:  leaf, 8
## 
## Dispersion estimate for gaussian family (sigma^2): 14.5 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   27.061      1.874  14.440  < 2e-16 ***
## treatStrong    7.879      1.901   4.144 3.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To change the covariance matrix to a correlation matrix:

cov2cor(vcov(tobacco_lmm1)$cond)
##             (Intercept) treatStrong
## (Intercept)   1.0000000  -0.5073298
## treatStrong  -0.5073298   1.0000000

Now, we can see the correlation between weak and strong is -0.51.

To get confidence intervals:

tidy(tobacco_lmm1, conf.int=T)
# for just fixed effects:
tidy(tobacco_lmm1, effects = 'fixed', conf.int=T)
# for just random efffects:
tidy(tobacco_lmm1, effects = 'ran_pars', conf.int=T)

To get how different each intercept is:

(leaf_re <- tobacco_lmm1 %>% ranef %>% unlist)
## cond.leaf.(Intercept)1 cond.leaf.(Intercept)2 cond.leaf.(Intercept)3 
##             -0.3539126             -1.5406161             -0.7111775 
## cond.leaf.(Intercept)4 cond.leaf.(Intercept)5 cond.leaf.(Intercept)6 
##             -2.5604485             -1.2399299              3.5023104 
## cond.leaf.(Intercept)7 cond.leaf.(Intercept)8 
##              5.6216063             -2.7178321
mean(leaf_re) # essentially zero
## [1] -2.081668e-17
sd(leaf_re)
## [1] 2.984558

To get pseudo-r^2:

MuMIn::r.squaredGLMM(tobacco_lmm1)
##            R2m       R2c
## [1,] 0.3707882 0.6761025

First is the marginal effect R^2 (fixed effects only)

Second is the conditional R^2 (fixed and random effects together)

We can also use the performance package:

performance::r2_nakagawa(tobacco_lmm1)
## # R2 for Mixed Models
## 
##   Conditional R2: 0.676
##      Marginal R2: 0.371

Partial plots

tobacco_lmm1 %>% emmeans(~treat) %>%
  as.data.frame() %>%
  rename(nlegion = emmean, lwr = lower.CL, upr = upper.CL) %>%
  ggplot(aes(x = treat, y = nlegion)) +
  geom_pointrange(aes(ymin = lwr, ymax = upr)) +
  geom_jitter(data = tobacco, col = "red", height = 0, width = 0.1) +
  labs(x = "Treatment", y = "Number of legions")

Model investigation / hypothesis testing

Predictions

Summary figures

References